Suffix Array of Alignment: A Practical Index for Similar Data

نویسندگان

Joong Chae Na

Heejin Park

Sunho Lee

Minsung Hong

Thierry Lecroq

Laurent Mouchard

Kunsoo Park

چکیده

The suffix tree of alignment is an index data structure for similar strings. Given an alignment of similar strings, it stores all suffixes of the alignment, called alignment-suffixes. An alignment-suffix represents one suffix of a string or suffixes of multiple strings starting at the same position in the alignment. The suffix tree of alignment makes good use of similarity in strings theoretically. However, suffix trees are not widely used in biological applications because of their huge space requirements, and instead suffix arrays are used in practice. In this paper we propose a space-economical version of the suffix tree of alignment, named the suffix array of alignment (SAA). Given an alignment ρ of similar strings, the SAA for ρ is a lexicographically sorted list of all the alignment-suffixes of ρ. The SAA supports pattern search as efficiently as the generalized suffix array. Our experiments show that our index uses only 14% of the space used by the generalized suffix array to index 11 human genome sequences. The space efficiency of our index increases as the number of the genome sequences increases. We also present an efficient algorithm for constructing the SAA.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

FM-index of alignment with gaps

Recently, a compressed index for similar strings, called the FM-index of alignment (FMA), has been proposed with the functionalities of pattern search and random access. The FMA is quite efficient in space requirement and pattern search time, but it is applicable only for an alignment of similar strings without gaps. In this paper we propose the FM-index of alignment with gaps, a realistic inde...

متن کامل

Implementation and performance analysis of efficient index structures for DNA search algorithms in parallel platforms

Because of the large datasets that are usually involved in deoxyribonucleic acid (DNA) sequence alignment, the use of optimal local alignment algorithms (e.g., Smith–Waterman) is often unfeasible in practical applications. As such, more efficient solutions that rely on indexed search procedures are often preferred to significantly reduce the time to obtain such alignments. Some data structures ...

متن کامل

Compressed and Searchable Indexes for Highly Similar Strings (Invited Talk)

The collection indexing problem is defined as follows: Given a collection of highly similar strings, build a compressed index for the collection of strings, and when a pattern is given, find all occurrences of the pattern in the given strings. Since the index is compressed, we also need a separate operation which retrieves a specified substring of one of the given strings. Such a collection of ...

متن کامل

Practical aspects of Compressed Suffix Arrays and FM-Index in Searching DNA Sequences

Searching patterns in the DNA sequence is an important step in biological research. To speed up the search process, one can index the DNA sequence. However, classical indexing data structures like suffix trees and suffix arrays are not feasible for indexing DNA sequences due to main memory requirement, as DNA sequences can be very long. In this paper, we evaluate the performance of two compress...

متن کامل

Suffix Tree of Alignment: An Efficient Index for Similar Data

We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings A and B is a compacted trie representing all suffixes in A and B. It has |A|+ |B| leaves and can be constructed in O(|A|+ |B|) time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not ex...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Suffix Array of Alignment: A Practical Index for Similar Data

نویسندگان

چکیده

منابع مشابه

FM-index of alignment with gaps

Implementation and performance analysis of efficient index structures for DNA search algorithms in parallel platforms

Compressed and Searchable Indexes for Highly Similar Strings (Invited Talk)

Practical aspects of Compressed Suffix Arrays and FM-Index in Searching DNA Sequences

Suffix Tree of Alignment: An Efficient Index for Similar Data

عنوان ژورنال:

اشتراک گذاری